Part-of-Speech Tagging for Code-Mixed English-Hindi Twitter and Facebook Chat Messages
نویسندگان
چکیده
The paper reports work on collecting and annotating code-mixed English-Hindi social media text (Twitter and Facebook messages), and experiments on automatic tagging of these corpora, using both a coarse-grained and a fine-grained part-ofspeech tag set. We compare the performance of a combination of language specific taggers to that of applying four machine learning algorithms to the task (Conditional Random Fields, Sequential Minimal Optimization, Naïve Bayes and Random Forests), using a range of different features based on word context and wordinternal information.
منابع مشابه
SMPOST: Parts of Speech Tagger for Code-Mixed Indic Social Media Text
Use of social media has grown dramatically fast during the past few years. Users usually follow informal languages in communicating through social media. This language of communication is often mixed in nature, where people transcribe their regional language with English. This technique of writing is increasing its popularity rapidly. Natural language processing (NLP) aims to infer the informat...
متن کاملRecurrent Neural Network based Part-of-Speech Tagger for Code-Mixed Social Media Text
This paper describes Centre for Development of Advanced Computing’s (CDACM) submission to the shared task’Tool Contest on POS tagging for CodeMixed Indian Social Media (Facebook, Twitter, and Whatsapp) Text’, collocated with ICON-2016. The shared task was to predict Part of Speech (POS) tag at word level for a given text. The codemixed text is generated mostly on social media by multilingual us...
متن کاملPOS Tagging of English-Hindi Code-Mixed Social Media Content
Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums, an...
متن کاملA POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam
Building Part-of-Speech (POS) taggers for code-mixed Indian languages is a particularly challenging problem in computational linguistics due to a dearth of accurately annotated training corpora. ICON, as part of its NLP tools contest has organized this challenge as a shared task for the second consecutive year to improve the state-of-the-art. This paper describes the POS tagger built at Surukam...
متن کاملPOS Tagging of Hindi-English Code Mixed Text from Social Media: Some Machine Learning Experiments
We discuss Part-of-Speech(POS) tagging of Hindi-English Code-Mixed(CM) text from social media content. We propose extensions to the existing approaches, we also present a new feature set which addresses the transliteration problem inherent in social media. We achieve an 84% accuracy with the new feature set. We show that the context and joint modeling of language detection and POS tag layers do...
متن کامل